Training a Logistic Regression model with Gradient Descent optimization (binary classification)¶
input features: , , ..., and an extra constant (bias)
output feature: - a probability prediction for being in Class 1 (being
positivein case of a disease)weights are calculated in the training phase to minimize a loss function, weights and a weight associated with the bias (intercept)
first we calculate the linear combination
- Then we apply an activation function, typically a sigmoid function
Cross entropy - loss function for binary classification¶
- : binary (0-1) vector of the true categories
- : vector of the predictions with probabilities,
Reminder:
- without per-sample normalization:
from sklearn.metrics import log_loss
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
%matplotlib inline
import pandas as pd
names = ["Sample_code_number", "Clump_Thickness", "Uniformity_of_Cell_Size", "Uniformity_of_Cell_Shape",
"Marginal_Adhesion", "Single_Epithelial_Cell_Size", "Bare_Nuclei", "Bland_Chromatin",
"Normal_Nucleoli", "Mitoses", "Class"]
#df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data",
# names=names,
# na_values="?")
df = pd.read_csv("breast-cancer-wisconsin.data",
names=names,
na_values="?")
df["Bare_Nuclei"] = df["Bare_Nuclei"].fillna(df["Bare_Nuclei"].median())
df["Bare_Nuclei"] = df["Bare_Nuclei"].astype('int64')
df = df.set_index("Sample_code_number")
X = df[names[1:-1]].values
y = df[names[-1]].values // 2 - 1
X.shape, y.shape
((699, 9), (699,))
### Append (preprend) the constant 1 (bias) column to X
num_samples = X.shape[0]
num_features = X.shape[1]
X = np.hstack((np.ones((num_samples, 1)), X))
X.shape
(699, 10)
### Create the necessary functions to
### predict from (X, w) and calculate the gradient from (X, w,y )
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def feedforward(X, w):
z = X @ w
y_pred = sigmoid(z)
return y_pred
def backprop(X, w, y):
y_pred = feedforward(X, w)
delta = y_pred - y
gradient = X.T @ delta
return gradient
w0 = np.zeros(num_features + 1)
gradient_true = backprop(X, w0, y)
gradient_true
array([ 108.5, -190. , -488.5, -460. , -356. , -153. , -606.5, -239.5,
-411. , -68.5])
Do a sanity check on the gradient¶
Reminder: in case of a two-variable function:
h = 0.0000001
w0 = np.zeros(num_features + 1)
y_pred = feedforward(X, w0)
f0 = log_loss(y, y_pred, normalize=False)
print(f0)
idx = 9
w0[idx] += h
y_pred = feedforward(X, w0)
f1 = log_loss(y, y_pred, normalize=False)
print(f1)
print(f"Gradient approximation:\t{(f1 - f0) / h}")
print(f"True gradient:\t\t{gradient_true[idx]}")
484.50987921140177 484.50987236140645 Gradient approximation: -68.49995315860724 True gradient: -68.5
from IPython.display import Image
Image(filename='fv_3d.png')
Image("contours.png")
Image("grad_desc.png", width=600)
from tqdm import tqdm
def logreg_fit(X, y, learning_rate=0.001, num_epochs=10):
loss_hist = []
w = np.zeros(X.shape[1])
for idx in tqdm(range(num_epochs)):
gradient = backprop(X, w, y)
w = w - gradient * learning_rate
y_pred = feedforward(X, w)
loss = log_loss(y, y_pred)
loss_hist.append(loss)
return w, loss_hist
w_opt, loss_hist = logreg_fit(X, y, 1e-4, 10000)
plt.plot(loss_hist)
plt.xlabel("Number of epochs")
plt.ylabel("Log loss")
plt.title(f"Loss history final loss: {loss_hist[-1]}");
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:02<00:00, 3773.60it/s]
Try out different learning rate parameters: smaller, medium, larger ones¶
w_opt, loss_hist = logreg_fit(X, y, 1e-5, 10000)
plt.plot(loss_hist)
plt.xlabel("Number of epochs")
plt.ylabel("Log loss")
plt.title(f"Loss history final loss: {loss_hist[-1]}");
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:02<00:00, 3762.42it/s]
w_opt, loss_hist = logreg_fit(X, y, 1e-4, 10000)
plt.plot(loss_hist)
plt.xlabel("Number of epochs")
plt.ylabel("Log loss")
plt.title(f"Loss history final loss: {loss_hist[-1]}");
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:02<00:00, 3706.56it/s]
w_opt, loss_hist = logreg_fit(X, y, 1e-3, 10000)
plt.plot(loss_hist)
plt.xlabel("Number of epochs")
plt.ylabel("Log loss")
plt.title(f"Loss history final loss: {loss_hist[-1]}");
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:02<00:00, 3760.45it/s]
w_opt, loss_hist = logreg_fit(X, y, 1e-2, 10000)
plt.plot(loss_hist)
plt.xlabel("Number of epochs")
plt.ylabel("Log loss")
plt.title(f"Loss history final loss: {loss_hist[-1]}");
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:02<00:00, 3749.72it/s]
w_opt, loss_hist = logreg_fit(X, y, 1e-1, 50)
plt.plot(loss_hist)
plt.xlabel("Number of epochs")
plt.ylabel("Log loss")
plt.title(f"Loss history final loss: {loss_hist[-1]}");
0%| | 0/50 [00:00<?, ?it/s]/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) /var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp return 1 / (1 + np.exp(-z)) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 1575.83it/s]
w_opt, loss_hist = logreg_fit(X, y, 3e-4, 10000)
plt.plot(loss_hist)
plt.xlabel("Number of epochs")
plt.ylabel("Log loss")
plt.title(f"Loss history final loss: {loss_hist[-1]}");
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:02<00:00, 3692.98it/s]
Stochastic Gradient Descent (SGD) algorithm¶
- Shuffle the samples randomly
- iterate through the sum loop by batches and update after every single batch
rs = np.random.RandomState(42)
rs.choice(10, size=10, replace=False)
array([8, 1, 5, 0, 7, 2, 9, 4, 3, 6])
from tqdm import tqdm
def logreg_sgd_fit(X, y, learning_rate=0.001, num_epochs=10, batch_size=32, random_state=42):
rs = np.random.RandomState(random_state)
loss_hist = []
w = np.zeros(X.shape[1])
num_samples = X.shape[0]
for _ in tqdm(range(num_epochs)):
permutation = rs.choice(num_samples, size=num_samples, replace=False)
X = X[permutation]
y = y[permutation]
for idx in range(num_samples // batch_size):
X_batch = X[idx * batch_size: (idx + 1) * batch_size]
y_batch = y[idx * batch_size: (idx + 1) * batch_size]
gradient = backprop(X_batch, w, y_batch)
w = w - gradient * learning_rate
y_pred = feedforward(X, w)
loss = log_loss(y, y_pred)
loss_hist.append(loss)
return w, loss_hist
w_opt, loss_hist = logreg_sgd_fit(X, y, 3e-3, 100, batch_size=32)
plt.plot(loss_hist)
plt.xlabel("Number of iterations")
plt.ylabel("Log loss")
plt.title(f"Loss history final loss: {loss_hist[-1]}");
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 171.38it/s]
w_opt, loss_hist = logreg_sgd_fit(X, y, 1e-3, 100, batch_size=32)
plt.plot(loss_hist)
plt.xlabel("Number of iterations")
plt.ylabel("Log loss")
plt.title(f"Loss history final loss: {loss_hist[-1]}");
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 170.10it/s]
w_opt, loss_hist = logreg_sgd_fit(X, y, 1e-1, 100, batch_size=32)
plt.plot(loss_hist)
plt.xlabel("Number of iterations")
plt.ylabel("Log loss")
plt.title(f"Loss history final loss: {loss_hist[-1]}");
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 172.22it/s]
### Apply some smoothing on the loss curves
w_opt, loss_hist = logreg_sgd_fit(X, y, 1e-3, 100, batch_size=32)
plt.plot(loss_hist)
plt.xlabel("Number of iterations")
plt.ylabel("Log loss")
loss_smoothed = np.convolve(loss_hist, np.ones(100) / 100, mode="valid")
plt.plot(range(100, len(loss_smoothed) + 100), loss_smoothed, "r-")
plt.title(f"Loss history final loss: {loss_hist[-1]}");
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 164.93it/s]
### Compare the results to the sklearn.linear_model implementation
w_opt, loss_hist = logreg_fit(X, y, 3e-4, 100000)
w_opt
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [00:26<00:00, 3776.54it/s]
array([-9.71454392, 0.53464755, 0.01128227, 0.32376783, 0.23762062,
0.05832409, 0.42816088, 0.41212863, 0.15824303, 0.53584273])
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=10000)
model.fit(X[:, 1:], y)
model.intercept_, model.coef_
(array([-9.72392166]),
array([[0.53527556, 0.010504 , 0.32491021, 0.23780949, 0.0579741 ,
0.4285776 , 0.41241511, 0.15827911, 0.53749624]]))
Neural network training - backpropagation¶
Image(filename='nn.png')
Backpropagation algorithm¶
How can we optimize the weights in a general neural network? We need gradients, i.e. the partial derivatives with respect to all of the weights.
Let's try to understand the calculation with a simple example. We have a 2D binary classification problem with 1 hidden layer of neurons. We need two weight matrices: of size and of size Forward propagation just consists of matrix-vector multiplications and applying the activation functions
The contribution of a single sample to the log loss function:
We have to calculate the partial derivatives of with respect to the entries of and . It is easier to start from backwards with the weights and apply the chain rule:
Now we can get the derivatives w.r.t. the entries of by using a massive amount of chain rules:
Short summary:
- using a given set of weights we can use feedforward calculation to get the activations and outputs
- we calculate the error on the output layer
- we propagate the errors backwards to calculate other values
- from the deltas we can simply get the partial derivatives:
- then we have the gradient and can apply any gradient-based optimization method